Skip to content

Use YAML counter_defs.yaml as single source of truth for all GFX metrics#79

Open
mehdi-saeedi wants to merge 2 commits intomainfrom
yaml-metrics-all-gfx
Open

Use YAML counter_defs.yaml as single source of truth for all GFX metrics#79
mehdi-saeedi wants to merge 2 commits intomainfrom
yaml-metrics-all-gfx

Conversation

@mehdi-saeedi
Copy link
Collaborator

@mehdi-saeedi mehdi-saeedi commented Mar 20, 2026

Summary

  • YAML-driven metrics for all architectures: Move all metric definitions from Python @metric decorators in gfx942.py and gfx90a.py into counter_defs.yaml, unifying the approach already used by RDNA (gfx1201/gfx1151) to cover CDNA (gfx942/gfx90a). Remove ~1,200 lines of decorator-based metric code.
  • Dynamic device specs from rocminfo/rocm-smi: Replace hardcoded DeviceSpecs with runtime queries (rocminfo for CU count, wavefront size, L2 cache, clocks; rocm-smi for memory clock). Fixes 5 incorrect MI210 values that were using MI250X numbers (HBM bandwidth, L2 size, FLOPS).
  • Cleaner base class: _BUILTIN_EXPRESSION_VARS is now derived from DeviceSpecs fields via dataclasses.fields() instead of a manually maintained static set. YAML expression namespace injection is also dynamic.

Changes

File What changed
counter_defs.yaml +18 CDNA metrics with architecture-specific expressions; RDNA metrics renamed to dotted convention
device_info.py New — rocminfo/rocm-smi parser + fallback specs table with AMD source links
base.py Dynamic expression var discovery; removed unused DeviceSpecs fields (l2_bandwidth_gbs, fp32_tflops, fp64_tflops, int8_tops)
gfx942.py / gfx90a.py Removed all @metric methods, now use query_device_specs()
test_backend_metrics.py Updated to call metrics via YAML compute path
test_error_handling.py Same pattern update

Bugs fixed

Field Old (hardcoded) New (correct) Source
MI210 hbm_bandwidth_gbs 3200 1600 GB/s AMD MI210 spec
MI210 l2_size_mb 16 8 MB rocminfo live query
MI210 fp32_tflops 47.9 (removed, unused)
MI210 fp64_tflops 47.9 (removed, unused)
MI210 int8_tops 383 (removed, unused)

Test plan

  • All 127 unit tests pass on MI210 (gfx90a)
  • gfx942 tests pass via static fallback (arch mismatch path)
  • Live profiling verified on MI210 — metrics compute correctly with corrected specs
  • Verify on MI300X hardware

Move all metric definitions from Python @Metric decorators in gfx942.py
and gfx90a.py into counter_defs.yaml. This unifies the approach already
used by RDNA (gfx1201/gfx1151) to cover CDNA (gfx942/gfx90a) as well.

Changes:
- base.py: extend YAML expression namespace with device spec variables
  (BASE_CLOCK_MHZ, HBM_BANDWIDTH_GBS, DURATION_US, etc.) and support
  unsupported_reason field in YAML definitions
- counter_defs.yaml: add 18 CDNA metrics with architecture-specific
  expressions; rename RDNA metrics to dotted convention (GPU_UTILIZATION
  -> compute.gpu_utilization, L2_HIT_RATE merged into memory.l2_hit_rate)
- gfx942.py/gfx90a.py: remove all @Metric methods (~1100 lines), keep
  only infrastructure (device specs, counter groups, rocprof invocation)
- Update tests to use YAML metric computation path

Made-with: Cursor
@mehdi-saeedi mehdi-saeedi force-pushed the yaml-metrics-all-gfx branch from 93f5eda to 9b09864 Compare March 20, 2026 16:07
@mehdi-saeedi mehdi-saeedi requested a review from mawad-amd March 20, 2026 16:49
…tion

Remove redundant method overrides (get_metric_counters, get_required_counters,
compute_metric_stats) that duplicated base class logic. Both RDNA4 backends
now follow the same pattern as gfx942/gfx90a.

Made-with: Cursor
@mehdi-saeedi mehdi-saeedi self-assigned this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant